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Introduction 

In  Year  2,  we  were  concerned  with  estimation  techniques  and  analysis  of  data  on 
breast  cancer  data  from  the  Utah  Population  Data  Base  (UPDB)  and  the  Utah 
Cancer  Registry  (UCR).  Technical  difficulties  associated  with  estimation  of  the  haz¬ 
ard  function  are  described  at  length  in  our  previous  report.  All  these  difficulties 
have  been  surmounted  and  the  desired  estimates  have  been  obtained  from  the  data 
amassed  in  the  UPDB  and  UCR. 

1.  Statement  of  Work 

This  annual  report  covers  the  following  four  tasks  formulated  in  the  statement  of 
work. 

Task  1.  Extraction  of  breast  cancer  cohort  data  from  the  UPDB  and  UCR. 

Task  2:  Development  of  computer  programs  for  estimation  of  family  history  of  breast 
cancer. 

Task  3:  Development  of  software  for  extended  hazard  regression  using  linear,  quadratic 
and  cubic  splines. 

Task  4:  Evaluation  of  family  history  as  a  predictor  of  breast  cancer  on  simulated 
data. 

Task  5:  Extended  hazard  regression  modeling  using  familial  risk  estimates  from  the 
breast  cancer  cohort. 

Task  10:  Preparation  and  mailing  of  the  annual  report. 

Comment:  In  order  to  accommodate  generally-structured  data,  we  have  developed  a 
new  methodological  approach  to  the  problem  of  optimal  breast  cancer  surveillance. 
This  is  the  reason  why  we  began  with  Tasks  7,  8,  and  9  in  year  1.  This  explains  why 
the  present  report  covers  Tasks  1,  2,  3,  4,  and  5  originally  scheduled  for  year  2. 


2.  The  research  carried  out  to  meet  the  objectives 
of  Tasks  1,  2,  3,  4,  5,  10 

2.1.  Introduction 

In  Year  1,  we  developed  several  numerical  algorithms  and  software  for  estimating 
the  hazard  function  for  breast  cancer  incidence.  Allowing  for  the  effects  of  random 
censoring  and  truncation,  these  procedures  have  been  used  for  testing  covariate  effects 
associated  with  different  indicators  of  family  history. 

2.1.  Estimation  of  the  hazard  rate 

Proceeding  from  preliminary  studies  of  different  spline  estimation  procedures,  we 
chose  to  model  the  hazard  function  via  quadratic  splines.  A  quadratic  spline  with  m 
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(1) 


knots  specifies  the  hazard  to  be  of  the  form 

2  m 

^m{t)  =  1^70/  +  Il7?2(^  -  rj)l 

i=0  j=l 

where  (a;)+  =  max(a;,  0).  For  each  birth  cohort,  we  fit  splines  with  knots  which  are 
ec^nally  spaced  in  the  interior  of  the  interval  -l'77iaa:]5  ivhere  is  the  minimnin 

truncation  age  in  the  cohort  and  Tmax  the  maximum  follow-up  (failure  or  censoring) 
time.  Restrictions  are  placed  on  the  coefficients  to  ensure  that  \m{t)  remains  positive 
for  all  t.  Thus  with  m  knots  the  number  of  parameters  is  m  +  3.  Models  can  be 
fit  using  Tnaximum  likelihood  techniques  applied  to  the  corresponding  conditional 
likelihood,  as  discussed  in  our  previous  report. 

We  have  developed  software  designed  to  compute  the  spline  estimates  by  max¬ 
imizing  the  likelihood  function  using  the  algorithm  of  Powell.  We  start  with  one 
knot  and  increase  the  number  of  knots  until  the  fit  is  not  improved,  as  determined 
by  the  likelihood  ratio  test  at  the  significance  level  a  =  0.05.  Three  other  subcohort 
estimates  of  the  hazard  function  were  computed  for  comparison  with  the  spline  esti¬ 
mator;  an  estimator  of  the  life  table  type,  a  Gaussian  kernel  estimate  based  on  the 
Nelson-Aalen  nonparametric  estimator,  and  local  likelihood  estimators  with  differ¬ 
ent  kernels  (uniform,  Epachechnikov,  and  Gaussian).  All  the  estimators  mentioned 
above  are  in  good  agreement  with  each  other  when  applied  to  the  UPDB  data. 

Using  the  computer  programs  developed  in  Year  1,  the  hazard  function  for  can¬ 
cer  incidence  has  been  estimated  from  left  truncated  and  right  censored  data  on 
individuals  identified  through  the  UPDB  and  UGR. 

Although  the  estimates  become  less  reliable  at  increasing  age,  the  hazard  function 
for  breast  cancer  appears  to  be  essentially  non-decreasing  in  all  the  categories  of  all 
familial  measures  considered.  Thus  we  find  no  evidence  of  an  "immune  fraction”  in 
this  analysis.  The  curves  for  different  levels  of  risk  appear  not  to  merge  or  cross, 
indicating  that  the  increased  risk  to  those  with  a  family  history  does  not  dissipate 
after  a  certain  age. 

This  study  is  presented  at  length  in  the  paper  by  Boucher  and  Kerber  included 
in  Appendix  1. 

2.2.  Measures  of  Familial  Aggregation  as  Predictors  of  Breast 
Cancer  Risk 

Several  measures  of  familial  disease  aggregation  have  been  proposed,  but  only  a  few 
of  these  are  designed  to  be  implemented  at  the  individual  level.  We  have  evaluated 
four  of  them  in  the  context  of  breast  cancer  incidence.  After  extensive  discussions, 
we  came  to  the  conclusion  that  testing  different  measmes  of  family  history  with 
simulated  data  was  not  warranted  in  view  of  the  fact  that  such  a  study  would  have 
added  little  to  the  results  of  real  data  analysis.  Therefore,  we  decided  to  focus  on  a 
more  comprehensive  analysis  of  epidemiological  data  employing  a  wider  spectrum  of 
potential  predictors  of  breast  cancer  risk. 

A  population-based  cohort  consisting  of  114,429  women  born  between  1874  and 
1931  and  at  risk  for  breast  cancer  after  1965  was  identified  by  linking  the  UPDB 
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and  the  UCR.  Three  competing  methods  were  used  to  obtain  predictors  of  familial 
aggregation  of  risk:  the  number  of  first  degree  relatives  with  breast  cancer,  the 
posterior  probability  of  carrying  BRCAl  or  BRCA2,  and  the  Familial  Standardized 
Incidence  Ratio  (FSIR),  which  weights  the  disease  status  of  relatives  based  on  then- 
degree  of  relatedness  with  the  proband.  Spline  regression  methods  were  used  to 
estimate  the  hazard  function,  stratified  by  measures  of  familial  aggregation. 

We  dichotomized  each  of  our  measures  of  familial  risk,  with  the  high  risk  category 
representing  approximately  8.5%  of  the  data  in  each  case.  This  was  a  natural  cut 
point,  as  it  represents  the  proportion  of  subjects  with  one  or  more  first  degree  rela¬ 
tives  with  breast  cancer.  The  cutoff  for  FSIR  roughly  corresponds  to  a  relative  risk 
of  two  to  family  members.  The  cut  points  for  the  posterior  probability  of  BRCAl 
and  BRCA2  come  at  points  where  the  posterior  probability  is  rather  small,  less  than 
0.0005  in  both  cases. 

Om  previous  analysis  indicated  that  a  highly  significant  birth-year  effect  exists 
in  the  data,  with  a  women  born  ten  years  later  having  an  estimated  40%  increased 
age-specific  risk.  Birth-year  was  included  as  an  additional  covariate  in  all  regression 
analyses.  The  baseline  risk  was  estimated  using  splines,  with  the  proportional  haz¬ 
ards  model  used  for  birth-year  and  familial  risk.  As  with  most  of  the  models,  we 
found  that  two  knots  were  sufficient  to  provide  an  optimal  fit. 

The  presence  of  a  first  degree  relative  with  breast  cancer  and  the  dichotomized 
FSIR  variable  each  appear  to  be  equally  effective  at  distinguishing  high  risk  sub¬ 
jects,  with  the  high  risk  category  having  about  double  the  risk,  while  the  posterior 
probability  of  BRCAl  and  BRCA2  appear  to  be  less  effective. 

We  performed  a  more  detailed  stratified  analysis  of  FSIR.  The  category  bound¬ 
aries  were  the  approximate  75th,  90th,  and  99.9th  percentiles  of  the  (adjusted)  FSIR 
distribution.  The  upper  category  roughly  corresponds  to  the  reported  fraction  of  the 
general  population  carrying  known  breast  cancer  genes.  Bootstrap  confidence  bands 
were  computed  as  well  as  an  indicator  of  the  reliability  of  the  estimates. 

The  estimates  of  the  age-specific  hazard  and  percentile-based  bootstrap  confi¬ 
dence  intervals  are  presented  in  Figure  1.  The  bootstrap  confidence  intervals  are 
based  on  100  bootstrap  samples,  except  for  the  <  75th  percentile  category,  which 
is  based  on  20  bootstrap  samples,  because  of  the  extensive  time  it  took  to  fit  the 
models  to  the  large  datasets. 

We  incorporated  the  posterior  probabilities  of  BRCAl  and  BRCA2  and  their  log¬ 
arithms,  as  well  as  log  log  FSIR  as  continuous  variables  in  separate  analyses,  using 
a  proportional  hazards  model  with  birth-year  as  an  additional  covariate.  The  best  re¬ 
sult  (in  terms  of  statistical  significance)  was  obtained  by  including  the  log  log  FSIR, 
where  we  get  a  likelihood  ratio  Xi  —  316.72  (p  <  0.00001). 

We  also  considered  the  indicator  variable  NFIRST  for  presence/absence  of  a 
first  degree  relative,  in  a  proportional  hazards  model.  The  behavior  of  the  hazard 
function  across  different  strata  shows  that  the  proportional  hazards  assumption  is 
not  grossly  violated.  The  variable  NFIRST  was  highly  significant  (likelihood  ratio 
Xi  =  185.6,  p  <  0.0001).  Addition  of  a  second  indicator  variable  for  two  or  more 
first  degree  relatives  with  breast  cancer  did  not  improve  the  Ukelihood  significantly. 
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More  technical  details  on  this  study  are  given  in  the  paper  by  Boucher  and  Kerber 
included  in  Appendix  2. 


2.3.  Modeling  cancer  detection 

Let  T  be  the  age  at  tumor  onset,  and  W  the  time  of  spontaneous  tumor  detection 
measured  from  the  onset  of  disease.  Introduce  the  random  variable  (r.v.)  iS  to 
represent  tumor  size  at  spontaneous  detection.  Then  S  =  f{W),  where  /  :  [0,  oo) 

[1,  oo)  is  a  deterministic  function  describing  the  law  of  tumor  growth.  It  is  assumed 
that 

(1)  random  variables  T  and  W  are  absolutely  continuous  and  independent; 

(2)  function  /  is  differentiable  and  /  >  0; 

(3)  the  rate  of  spontaneous  tumor  detection  is  proportional  to  the  current  tumor  size 
with  coefficient  a  >  0. 

We  observe  sample  values  of  the  random  vector  Y  :=  (T  +  W,  S)  which  compo¬ 
nents  are  interpreted  as  age  and  tumor  size  at  spontaneous  detection,  respectively. 
We  look  at  y  as  a  transformation  of  the  random  vector  X  ;=  (T,  W),  Y  =  (p{X), 
where  <p{t,w)  =  {t  +  w,f{w)),  t,w  >  0.  Observe  that  components  of  X  are  in¬ 
dependent  random  variables.  The  inverse  function  ip  =  :  A  R,^,  where 

A  :=  {{u,v)  6  R+  :  1  <  u  <  /(u)},  is  given  by  'ip{u,v)  =  (u  -  g{v),g{v)),  with 
g  :=  /-i.  Note  that  the  Jacobian  of  ip  is  g  .  Then  for  the  probability  density  function 
(p.d.f.)  of  y  we  have  assuming  that  {u,v)  G  A  : 

Py{u,v)  =  px{'4’{u,v))g  (v)  =  Pt{u  -  g{v))pw{g{v))g  {v) 

=  Pt{u-  giv))ps{v). 

In  the  particular  case  of  exponential  tumor  growth  with  rate  A  >  0  (/(w)  =  e'^^) 
we  obtain 

py{u,  v)  =  -  ^)  ,  u  >  0,  1  <  u  <  .  (2) 

A  A 

Thus,  the  distribution  of  random  vector  Y  is  absolutely  continuous  but  the  support  of 
y  depends  on  the  unknown  parameter  A.  As  far  as  the  asymptotic  likelihood  inference 
is  concerned,  the  usual  regularity  conditions  are  not  met  for  the  distribution  py  (u,  v). 
However,  experience  with  similar  parametric  settings  suggests  that  the  estimation 
efficiency  for  the  parameter  A  may  be  expected  to  be  even  higher  than  in  the  regular 
case  although  asymptotic  normality  may  fail. 

Let  :  1  <  i  <  n}  be  sample  data  on  age  and  tumor  size  at  detection. 

The  structme  of  the  joint  distribution  (2)  suggests  the  following  maximum  likelihood 
procedure  for  estimation  of  the  parameters  9  and  A: 

(1)  Denote  0  =  a/ A  in  formula  (2),  and  find  the  maximum  likelihood  estimate,  9, 
of  the  parameter  9  using  only  the  tumor  size  data  {uj  :  1  <  i  <  n}.  It  follows 
(see  below)  that  the  sample  {u*}  is  drawn  from  an  exponential  distribution  with 
parameter  9,  and  consequently 
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(2)  Maximize  the  function 


“  TivAui  -  Ui  >  0,Vi>  1, 

i=l 

or  its  logarithm,  to  find  the  estimate  of  A  denoted  by  A. 

(3)  The  maximum  likelihood  estimate  of  a  is  given  hy  a  =  6X. 

The  above  procedure  does  the  same  job  as  maximizing  the  likelihood  function 
based  on  the  joint  distribution  (2).  To  show  this,  let  the  joint  density  of  the  random 
variables  U  and  V  be  of  the  form 

p{u,  v\  A,  d)  =  f{u  -  e), 

where  u  >  0,t;  >  1,  A  >  0.  It  is  assumed  that  f(t)  >  0  for  t  >  0,  /(t)  =  0  for  t  <  0, 
g{x)  >  0  for  X  >  1,  and  (p  :  [1,  oo) (Oj^oo)  is  a  measmable  function.  Suppose  that 
there  exists  a  unique  maximizer  (A,  ^),  A  >  0,  for  the  likelihood  function 

L{X,9)  = 

i=\ 

Then  Ui  -  p{vi)/X  >  0  for  all  i,  whence 


A  >  max 

l<i<n 


Ui 


>0. 


Given  A  >  0,  it  is  clear  that  A  and  $  are  unique  maximizers  for  the  functions 


=  IIsK®). 


respectively.  Conversely,  if  A  >  0  and  9  are  unique  maximizers  for  these  functions, 
the  pair  (A,  is  a  unique  maximizer  for  the  likelihood  function  L{X,9).  Finally, 
observe  that  g{v]  9)  is  the  marginal  density  of  the  random  variable  V.  Indeed,  we 
have 

/  /(«  -  =  g{v\  9)  f  f{t)dt  =  g{v]  9). 

Jo  A  -'O 

The  performance  of  the  above  described  estimation  procedure  was  studied  by 
computer  simulations.  A  total  of  50  pseudo-random  samples  of  {ui^  Vi)  were  generated 
from  the  joint  distribution  (2);  each  sample  contained  n  =  100  realizations  of  the 
random  vector  (U,V).  We  used  the  composition  method  to  simulate  samples  of 
pairs  {ui,Vi).  In  accordance  with  this  method,  we  first  draw  vi  from  the  marginal 
distribution  of  the  random  variable  V,  and  then  generate  Ui  from  the  distribution  of 
U  conditional  on  V  =  Vi.  The  p.d.f.  Pt{x)  was  specified  by  the  Moolgavkar-Venzon- 
Knudson  model  of  carcinogenesis  with  the  survival  function  given  by  the  following 
formula: 

’  (A4- 

Grit)  :=  Pr{T  >  t)  =  ^  ^g(A+B)t 


t  >  0, 
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where  A,  B,  5  >  0  are  identifiable  parameters  of  the  model.  We  used  the  following 
values  of  model  parameters:  a  =  2.3  x  10“^°,  A  =  6.9,  A  =  =  0.1821,  J  = 

0.0364. 

Simulation  Experiment  1.  In  this  experiment,  we  kept  the  parameters  A,  B,  and  5  at 
their  true  values  and  applied  the  estimation  procedure  to  simulated  data  in  order  to 
obtain  estimates  of  the  parameters  A  and  a.  In  this  case,  the  likelihood  function  can 
be  maximized  by  a  unidimensional  search  for  A  with  a  fixed  value  of  9.  The  estimates 
of  A  and  a  resulted  from  each  of  the  50  samples  were  summarized  by  calculating  their 
sample  means  A  and  a,  as  well  as  the  corresponding  standard  errors  (of  the  sample 
mean)  denoted  by  and  a  a,  respectively.  We  obtained  the  following  numerical 
values:  A  =  7.45,  =  0.9,  d  =  2.53  x  10“^°,  Oa  =  0.34  x  10“^°.  These  results 

testify  that,  given  the  parameters  A,  B  and  5  are  known,  the  estimation  procedure 
performs  well  when  applied  to  finite  samples. 

Simulation  Experiment  2.  Proceeding  fi:om  the  same  true  parameter  values,  the 
estimation  procedure  was  applied  to  simulated  data  to  obtain  estimates  of  all  the 
parameters  incorporated  into  the  model.  Since  there  were  three  additional  param¬ 
eters  to  be  estimated  from  simulated  data,  the  size  of  each  sample  was  increased 
up  to  1000.  The  results  were  summarized  in  just  the  same  way  as  in  Experiment 
1  to  give:  A  =  9.4,  =  0.9,  d  =  3.1  x  10“^*^,  erg  =  3.1  x  10“^^,  A  =  9.5  x  10 

Ga  =  3.6  X  10-^  B  =  0.1407,  as  =  0.0599,  5  =  0.0507,  erj  =  0.006. 

Simulation  Experiment  3.  The  estimation  procedure  was  applied  to  a  single  sample 
of  size  50,000  generated  from  the  joint  distribution  (2J.  The  estimated  parameter 
values  were:  A  =  6.7,  d  =  2.24  x  10~^*^,  A  =  5.1  x  10^,  B  =  0.1390, 5  =  0.0475. 

The  above  simulation  experiments  show  that  estimation  of  the  whole  set  of  model 
parameters  is  feasible  given  the  model  is  adequate  for  the  processes  under  study,  but 
obtaining  unbiased  estimates  would  require  large  sample  sizes. 

Suppose  now  that  the  process  of  tumor  growth  is  described  by  the  exponential  law 
f{w)  =  e^'",  w>9,  with  a  random  growth  rate  A.  We  also  assume  that  the  random 
parameter  9  :=  a/ \  is  gamma  distributed  with  parameters  a  and  b.  Compounding 
(2)  with  respect  to  the  gamma  distribution  of  the  parameter  9  we  find  the  p.d.f.  of 
the  resulting  randomized  distribution  of  the  vector  Y  : 

Ua  pau/\nv  ,,  In  7; 

p{u,v)  =  - t)dt  ,  u  >  0,  u  >  1  . 

r(a)  Jo  a 

Setting  s  :=  u  —  {Inv/ajt  we  rewrite  the  last  formula  in  an  equivalent  form 


(] — )  /  (^- 5)“exp{-‘- — {b  +  v s)}pT{s)ds,  (3) 

r(a)  WiivJ  Jo  Int; 

for  u  >  0,  u  >  1.  Alternatively,  we  may  assume  that  it  is  the  parameter  1/A  that  is 
gamma  distributed  with  parameters  a  and  b.  Should  this  be  the  case,  we  have 

ryhfi  pu/\nv 

p{u,v)  =  /  t°'exp{-[b  +  a{v-l)]t}pT{u-tlnv)dt 

i.  {dj  i/  0 


0:6“ 


(lnu)“+^r(o) 


ru 

/  (u  — s)“exp{- 
Jo 


b  +  a{v  —  1) 
Inu 


(u  -  s)}pT{s)ds, 


(4) 
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for  n  >  0,  w  >  1. 

Once  the  density  pr  of  the  age  at  tumor  onset  T  is  specified  within  a  certain 
parametric  family,  equations  (3)  or  (4)  allow  us  to  compute  p.d.f.  of  the  joint  distri¬ 
bution  of  age  and  tumor  size  at  detection.  Observe  that  in  this  randomized  version 
the  support  [0,  oo)  x  [l,oo)  of  the  distribution  of  random  vector  Y  is  parameter 
free.  The  maximum  likelihood  parametric  inference  based  on  the  joint  p.d.f.  p('U,  v) 
accommodates  censored  observations  under  the  usual  independent  censorship  model. 

2.4.  Future  Plans 

Formulas  (2)  and  (3)  will  be  used  to  estimate  the  natural  history  of  breast  cancer 
from  the  UPDB  data.  This  will  allow  us  to  find  a  parametric  estimate  of  the  p.d.f. 
PT+w{t),  which  is  necessary  for  designing  optimal  schedules  of  breast  cancer  screening 
allowing  for  information  on  family  history. 


3.  Key  Research  Accomplishments 

Our  key  accomplishments  in  Year  2  can  be  summarized  briefly  as  follows: 

•  We  have  used  computer  programs  developed  in  Year  1  to  estimate  the  hazard 
function  from  data  on  breast  cancer  amassed  in  the  UPDB  and  UCR.  These  non- 
parametric  estimates  are  in  good  agreement  with  predictions  based  on  the  proposed 
mechanistic  model  of  cancer  development  and  detection. 

•  We  have  tested  several  aggregated  measures  of  family  history  as  predictors  of 
breast  cancer  risk.  This  study  points  the  way  for  data  stratification  required  for 
construction  of  individualized  strategies  of  breast  cancer  surveillance. 

•  We  have  derived  the  joint  distribution  of  tumor  size  and  age  at  detection  and 
its  randomized  counterpart  which  are  necessary  for  estimation  of  the  natural  history 
of  the  disease.  Simulation  experiments  have  been  conducted  to  evaluate  how  well 
unknown  parameters  incorporated  into  the  distribution  can  be  estimated  by  the 
maximum  likelihood  method  from  available  bivariate  data  on  tumor  size  and  age  at 
diagnosis  of  breast  cancer. 


4.  Reportable  Outcomes 

4.1.  New  Publications 

1.  Yakovlev,  A.Y.,  Tsodikov,  A.D.,  and  Hanin,  L.G.  Optimal  schedules  of  breast 
cancer  surveillance.  Abstract,  Era  of  Hope  Meeting,  Atlanta,  June  2000. 

2.  Boucher,  K.M.  and  Kerber,  R.A.  The  shape  of  the  hazard  function  for  cancer 
incidence.  Abstract,  Era  of  Hope  Meeting,  Atlanta,  June  2000. 

3.  Boucher,  K.M.  and  Kerber,  R.A.  The  Shape  of  the  Hazard  Function  for  Cancer 
Incidence,  Mathematical  and  Computer  Modelling,  to  appear. 
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4.  Boucher,  K.M.  and  Kerber,  R.A.  Measures  of  Familial  Aggregation  as  Predictors 
of  Breast  Cancer  Risk,  Journal  of  Epidemiology  and  Biostatistics,  under  revision. 

4.2.  Awards 

1.  Grant  1  UOl  CA88177-01,  NIH/NCI,  Mechanistic  Modeling  of  Breast  Cancer 
Smveillance,  RFA  ’’Cancer  Intervention  and  Surveillance  Network  (CISNET)”,  P.L: 
Yakovlev,  A.Y.,  09/01/00  -  08/31/04,  total  costs:  $  537,653. 

5.  Conclusions 

The  results  of  data  analysis  are  consistent  with  an  increasing  hazard  for  breast  cancer 
incidence  through  age  85  or  90.  The  hazard  function  appears  to  be  higher  for  more 
recent  birth  cohorts.  The  shape  of  the  hazard  function  appears  to  be  consistent 
with  a  two-stage  model  for  spontaneous  carcinogenesis  in  which  the  initiation  rate  is 
constant  or  increasing. 

We  have  applied  several  methods  of  measuring  familial  aggregation  at  the  indi¬ 
vidual  level  to  breast  cancer  data.  All  prove  to  be  significant  predictors  of  individual 
risk.  Judging  by  the  difference  in  risk  estimates,  as  well  as  the  likelihood  ratio  test, 
presence  of  a  first  degree  relative  and  FSIR  appear  to  be  better  indicators  of  in¬ 
creased  risk  than  the  posterior  probability  of  BRCAl  or  BRCA2.  Judging  solely 
by  the  likelihood  ratio  test,  one  would  prefer  FSIR.  The  latter  indicator  may  be 
thought  of  as  an  extension  of  the  cruder  number  of  first  degree  relatives  with  breast 
cancer,  adjusting  for  the  level  of  relatedness  and  expected  disease.  It  is  therefore  not 
surprising  to  find  that  it  performs  better. 

Marginal  distributions  of  tumor  size  and  age  at  detection  as  well  as  associated 
estimation  problems  were  discussed  in  our  previous  report.  Now  we  have  derived 
the  joint  distribution  of  these  two  random  variables  and  its  randomized  counterpart. 
Generally  speaking,  explicit  formulas  for  the  marginal  distributions  of  tumor  size 
and  age  of  an  individual  at  detection  are  not  sufficient  to  utilize  completely  the 
information  contained  in  the  corresponding  sample  observations  for  estimation  of 
the  natural  history  of  the  disease;  one  needs  to  know  their  joint  distribution  in  order 
to  develop  pertinent  methods  for  the  maximum  likelihood  statistical  inference. 


6.  So  what? 

1.  As  evidenced  by  the  results  of  data  analysis,  the  shape  of  the  hazard  function  for 
breast  cancer  incidence  is  consistent  with  predictions  based  on  the  proposed  mecha¬ 
nistic  model  of  cancer  development  and  detection. 

2.  We  now  know  how  the  data  should  be  stratified  with  respect  to  aggregated  char¬ 
acteristics  of  family  history  in  order  to  construct  individualized  optimal  strategies  of 
breast  cancer  screening. 

3.  In  Year  3,  our  focus  will  be  on  the  development  of  methods  for  parametric  estima¬ 
tion  of  the  natural  history  of  breast  cancer  based  on  formulas  (3)  and  (4)  from  the 
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UPDB  data  stratified  with  respect  to  individual  information  on  family  history.  This 
study  will  produce  estimates  to  be  used  for  designing  optimal  schedules  of  breast 
cancer  screening. 


14 


Appendix  1 


15 


The  Shape  of  the  Hazard  Function  for 
Cancer  Incidence 


Kenneth  M.  Boucher  and  Richard  A.  Kerber 


Huntsman  Cancer  Institute  and  Department  of  Oncological  Sciences,  University  of 
Utah,  2000  East  North  Campus  Drive,  Salt  Lake  City,  Utah  84112 


Running  title;  Hazard  Function  for  Incidence 


Corresponding  author: 

Kenneth  M.  Boucher,  Huntsman  Cancer  Institute  and  Department  of  Oncological 
Sciences,  University  of  Utah,  2000  Circle  of  Hope  Dr.,  Salt  Lake  City,  UT  84112, 
U.S.A. 


Phone:  801-585-9544,  FAX:  801-585-5357 
e-mail:  ken.boucher@hci.utah.edu 
7/00 


1 


ABSTRACT 


A  population-based  cohort  consisting  of  126,141  men  and  122,208  women  born 
between  1874  and  1931  and  at  risk  for  breast  or  colorectal  cancer  after  1965  was 
identified  by  linking  the  Utah  Population  Data  Base  and  the  Utah  Cancer  Registry. 
The  hazard  function  for  cancer  incidence  is  estimated  from  left  truncated  and  right 
censored  data  based  on  the  the  conditional  likelihood.  Four  estimation  procedures 
based  on  the  conditional  likelihood  are  used  to  estimate  the  age-specific  hazard  func¬ 
tion  from  the  data;  these  were  the  life-table  method,  a  kernel  method  based  on  the 
Nelson  Aalen  estimator,  a  spline  estimate,  and  a  proportional  hazards  estimate  based 
on  splines  with  birth  year  as  sole  covariate. 

The  results  are  consistent  with  an  increasing  hazard  for  both  breast  and  colorectal 
cancer  through  age  85  or  90.  After  age  85  or  90  the  hazard  function  for  female  breast 
and  colorectal  cancer  may  reach  a  plateau  or  decrease,  although  the  hazard  function 
for  male  colorectal  cancer  appears  to  continue  to  rise  through  age  105.  The  hazard 
function  for  both  breast  and  colorectal  cancer  appears  to  be  higher  for  more  recent 
birth  cohorts,  with  a  more  pronounced  birth-cohort  effect  for  breast  cancer  than  for 
colorectal  cancer.  The  age  specific  hazard  for  colorectal  cancer  appears  to  be  higher 
for  men  than  for  women.  The  shape  of  the  hazard  function  for  both  breast  and 
colorectal  cancer  appear  to  be  consistent  with  a  two-stage  model  for  spontaneous 
carcinogenesis  in  which  the  initiation  rate  is  constant  or  increasing.  Inheritance  of 
initiated  cells  appears  to  play  a  minor  role. 

KEYWORDS:  hazard  function,  truncation,  survival  analysis,  breast  cancer,  col¬ 
orectal  cancer 
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1.  Introduction 


The  shape  of  the  hazard  function  may  lead  to  insights  into  the  biology  of  carcinogen¬ 
esis  which  may  not  be  easily  discernable  from  a  study  of  the  survival  function  alone. 
For  example,  it  is  typical  in  the  analysis  of  tumor  recurence  data  to  find  a  hazard 
function  that  is  bimodal  or  unimodal,  and  that  tends  to  zero  as  time  tends  to  infinity 
[1],  The  modes  of  the  hazard  may  be  interpreted  biologically  as  arising  from  two 
different  types  of  failure,  one  that  tends  to  occur  earlier  and  one  that  tends  to  occur 
later.  The  decrease  in  the  hazard  function  to  zero  may  lead  one  to  conclude  that 
there  is  a  non-zero  cured  fraction.  In  fact,  if  we  let  X{t)  denote  the  hazard  function, 
and  p  the  probability  of  cure,  it  follows  from  the  formula 

p=  limexpl—  /  A(u)dMl, 
t^oo  L  Jo  J 

that  there  are  individuals  who  have  been  ’’cured”  in  the  population  exactly  when 
the  hazard  function  has  finite  integral.  In  particular,  limt^oo  X{t)  =  0,  provided  the 
limit  exists. 

If  the  hazard  function  under  study  is  from  disease  incidence,  the  ’’cured  fraction” 
must  be  re-interpreted  as  the  fraction  of  the  population  that  is  ’’immune”  to  the 
disease.  If  the  cumulative  hazard  appears  to  be  bounded,  for  example,  one  should 
expect  the  existence  of  a  non-zero  immune  fraction.  More  generally,  a  large  degree 
of  heterogeneity  in  disease  susceptibility  may  lead  to  a  population  hazard  function 
with  one  or  more  well-defined  maxima.  The  maxima  may  correspond  to  discrete 
subpopulations  with  different  genetic  predisposition  to  disease.  A  maximum  may 
also  result  from  a  continuous  frailty,  as  the  surviving  population  at  higher  ages  may 
be  overrepresented  by  individuals  with  lower  risk  [2]. 

Both  breast  and  colorectal  cancer  are  syndromes  in  which  an  inherited  suscepti¬ 
bility  has  been  shown  to  play  a  role.  Inherited  mutations  in  p53,  BRCAl,  BRCA2, 
the  ataxia-telangiectasia  gene  (AT),  HRAS,  and  the  androgen  receptor  gene  (AR) 
have  been  shown  to  play  a  role  in  breast  cancer  susceptibility  [3].  About  56%  of 
carriers  of  the  mutation  BRCAl  or  BRCA2  will  get  breast  cancer  by  the  age  of  70 
years  [4].  BRCAl  has  an  estimated  allele  frequency  of  between  0.0002  and  0.001 
(95%  Cl)  [5],  and  accounts  for  about  3%  of  diagnosed  breast  cancer  [6].  The  allele 
frequency  of  mutations  in  BRCA2  is  estimated  at  0.00022  [7].  Germline  mutations 
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in  p53  and  AR  are  extremely  rare,  and  mutations  in  the  HRASl  minisatellite  locus 
which  confer  increased  risk  of  breast  cancer  are  also  rare,  having  an  estimated  popu¬ 
lation  frequency  of  6%  [3].  In  a  study  of  100  Finnish  breast  cancer  families  analyzed 
by  protein  truncation  tests  and  direct  sequencing,  Vehmanen  et  al.  [8]  found  that 
only  21%  of  breast  cancer  families  were  accounted  for  by  mutations  of  BRCAl  and 
BRCA2,  providing  indirect  evidence  for  the  existence  of  other,  undiscovered  breast 
cancer  genes. 

Indirect  evidence  also  exists  for  the  existence  of  additional  colorectal  cancer  genes. 
Inherited  mutations  in  polyposis  coli  (ABC)  gene  and  the  hereditary  non-polyposis 
colon  cancer  syndrome  (HNPCC)  genes  hMSH2,  and  hMLHl  have  been  shown  to 
play  a  role  in  colon  cancer  susceptibility  [3].  After  segregation  analysis  of  203  pedi¬ 
grees,  Houlston  et  al.  [9]  concluded  that  dominant  colorectal  cancer  genes  with  a 
frequency  of  0.006  account  for  an  estimated  81%  of  colorectal  cancers  in  patients 
under  35,  59%  in  patients  between  35  and  49,  decreasing  to  16%  in  patients  over 
65.  The  I1307K  mutation  of  the  APC  gene,  found  in  Ashkenazi  Jews,  confers  an 
estimated  relative  risk  of  1.7  for  colorectal  cancer  (95%  Cl  1.01-2.87)  [10].  APC  and 
HNPCC  are  rare,  and  contribute  to  a  small  percentage  of  colorectal  cancer  cases  [3]. 

Additional  insight  can  be  gleaned  from  the  hazard  function  for  cancer  incidence  in 
the  framework  of  a  mechanistic  model  of  carcinogenesis.  The  most  widely  accepted 
model  is  the  Moolgavkar-Venzon-Knudson  two-stage  clonal  expansion  model  [11,12]. 
The  Moolgavkar-Venzon-Knudson  model  has  the  following  assumptions: 

(A)  Normal,  susceptible  target  cells  are  initiated  according  to  a  (nonhomogeneous) 
Poisson  process  with  intensity  z/(t). 

(B)  The  expansion  of  the  colony  of  initiated  cells  and  malignant  transformation  is 
specified  by  a  stochastic  birth-death-migration  process  with  the  division,  death  (or 
differentiation)  and  transformation.  Premalignant  cells  either  divide  into  two  pre- 
malignant  cells  with  rate  a{t),  die  with  rate  P{t),  or  divide  asymmetrically  into  one 
premalignant  cell  and  one  malignant  cell  with  rate 

It  has  been  shown  that  the  hazard  function  for  the  Moolgavkar-Venzon-Knudson 
model  with  constant  parameters  increases  monotonically  and  approaches  an  asymp¬ 
tote  [13].  An  asymptotic  value  for  the  hazard  is  also  reached  for  the  Moolgavkar- 
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Venzon-Knudson  model  with  piecewise  constant  parameters,  and  in  that  case  the 
value  of  the  asymptote  depends  only  on  the  value  of  the  coefficients  in  the  unbounded 
interval  [13,14]. 

Expressions  for  the  survivor  function  were  first  obtained  by  Moolgavkar  and  Lue- 
beck  [13].  A  simple  explicit  formula  for  the  survivor  function  S{t)  for  the  Moolgavkar- 
Venzon-Knudson  model  with  constant  parameters  was  obtained  by  Kopp- Schneider 
et  al.  [15]  and  Zheng  [16]: 


S{t)  = 


{—a  +  /3  +  //  +  c)  +  (a  —  /d  -  A*  +  c)e 


(1) 


where  c  =  yj{a  +  /?  +  —  4a/?.  Zheng  also  presented  an  expression  for  the  proba¬ 

bility  generating  function  for  the  number  of  malignant  cells  given  a  single  malignant 
cell  at  time  t  =  0,  allowing  an  expression  for  the  promotion  time  distribution 


F{t)  = 


(a 


(5  —  nF  c)(a  —  j3  —  fx  —  c)e  -k  (a  —  /?  —  /r  -|-  c)(— a  +  (3  +  pL  + 
2a[(a  —  P  —  n  +  c)e“^‘  +  {—a  +  P  +  n  +  c)] 


f) 

(2) 


to  be  given.  It  is  easy  to  see  that  S{t)  and  F{t)  above  are  related  by  the  formula 


S{t)  =  exp  j  F{x)dx^ 


(3) 


which  was  shown  by  Hanin  and  Yakovlev  [17]  to  be  valid  in  a  more  general  setting. 

Yakovlev  and  Tsodikov  [18]  replace  assumption  (B)  above  with  the  following  as¬ 
sumption: 

(C)  Progenitor  cells  are  transformed  into  malignant  lesions  at  a  random  with  cumu¬ 
lative  distribution  function  F{x).  All  progenitor  cells  are  promoted  independently 
of  one  another. 

Assuming  F{0)  =  0,  it  follows  that  the  process  of  malignant  transformation  is 
also  a  Poisson  process,  with  integral  rate  A(t)  =  /g  v{u)F{t  —  u)du.  As  in  the 
Moolgavkar- Venzon-Knudson  model,  the  simplest  model  of  spontaneous  carcinoge- 
nis  takes  v{f)  =  i/  to  be  constant,  in  which  case  A(t)  =  Jo  F(u)du  and  the  hazard 
function  for  time- to- tumor,  given  by  X{t)  —  vF{t),  is  nondecreasing.  The  probability 
S{t)  that  there  are  no  malignancies  by  time  t  is  then  given  by  (3). 

This  model  may  easily  be  modified  to  handle  inherited  lesions,  via  the  limiting 
case  where  v  is  taken  to  be  a  delta  function  at  the  origin.  If  F{t)  is  assumed  to  be 
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absolutely  continuous,  then  the  integral  rate  A(t)  is  equal  to  i'F(t)  and  the  hazard 
function  A(t)  =  F'{t)  =  f{t),  where  f{t)  is  the  density  function  associated  with  F{t). 
We  see  that  the  hazard  function  for  spontaneous  and  inherited  lesions  are  quite  likely 
to  have  very  different  shapes. 

Even  though  a  thorough  study  of  the  hazard  function  may  lead  to  new  insight  into 
the  process  of  carcinogenesis,  few  if  any  population-based  cohorts  have  been  analyzed 
to  determine  the  hazard  function  for  cancer  incidence.  In  addition,  time-dependent 
variation  in  environmental  risk  factors  for  cancer  may  cause  estimates  from  a  cross- 
sectional  study  to  be  misleading.  In  this  paper  the  age  specific  hazard  function 
for  both  breast  and  colorectal  cancer  incidence  are  estimated  using  data  from  the 
Utah  Cancer  Registry  and  the  Utah  Population  Data  Base.  We  see  that  the  hazard 
function  for  both  these  types  of  cancer  appears  to  be  increasing  monotonically,  at 
least  through  age  85  or  90.  In  the  context  of  the  above  mechanistic  models  of 
carcinogenesis,  we  will  see  that  risks  for  both  these  cancers  at  the  population  level 
appear  to  be  relatively  homogeneous,  with  negligible  inherited  component. 

2.  Methods 

2.1  Data 

The  data  for  this  study  was  obtained  by  linking  records  from  the  Utah  Population 
Data  Base  (UPDB)  with  the  Utah  Cancer  Registry  (UCR).  The  UPDB  consists  of 
the  genealogical  records  of  more  than  1,000,000  individuals  who  were  born,  died,  or 
married  in  Utah,  or  en  route  to  Utah  during  the  nineteenth  and  twentieth  centuries. 
Since  1973  the  UCR  has  been  reporting  to  National  Cancer  Institutes  Surveillance 
Epidemiology  and  End  Results  (SEER)  program,  and  is  required  to  maintain  very 
high  standards  for  case  reporting  and  follow-up,  and  to  periodically  undergo  quality 
control  audits  by  SEER  personnel  to  assure  uniformly  high  quality  and  consistency 
from  year  to  year.  The  available  follow-up  information  comes  either  from  Utah  death 
certificates,  which  have  been  linked  to  the  UPDB  genealogical  data  every  year  from 
1933  through  the  beginning  of  1997,  or  from  linkage  of  the  HCFA  beneficiary  data 
to  the  UPDB.  The  study  population  consisted  of  126,141  men  and  122,208  women 
recorded  in  the  Utah  Population  Database,  who  were  born  from  1874  to  1931  and  for 
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whom  follow-up  information  is  available  that  places  them  in  Utah  during  the  years 
of  operation  of  the  Utah  Cancer  Registry  (1966-present).  Subjects  with  purported 
follow-up  past  age  105  were  excluded  from  the  data.  There  are  5,372  cases  of  female 
breast  cancer  and  5,177  cases  of  colorectal  cancer  represented  in  the  data.  Analyses 
were  performed  on  subcohorts  based  on  birth  year  (1874-1889,  1890-1899,  1900-1909, 
1910-1919,  and  1920-1931)  and  gender.  For  each  gender  he  entire  cohort  (birth  years 
1874-1931)  was  also  analyzed  as  a  whole.  The  total  number  of  subjects  and  cases  of 
breast  and  colorectal  cancer  for  each  birth  subcohort  and  gender  are  given  in  Tables 
1  and  2.  Male  breast  cancer  was  not  analyzed. 

2.2  Truncation:  Nonparametric  Estimation 

We  wish  to  estimate  the  age  specific  hazard  function  for  breast  and  colorectal  cancer 
from  the  data  described  above,  taking  into  account  that  the  data  is  subject  to  random 
truncation:  cases  which  occurred  during  or  before  1965  are  not  recorded  in  the 
dataset.  Subject  were  between  the  ages  of  34  and  86,  at  the  time  of  truncation. 
Thus,  analysis  of  the  data  must  take  into  account  not  only  to  the  effects  of  right 
censoring,  but  also  the  effects  of  left  truncation  due  to  delayed  entry  into  the  risk 
set.  The  topic  of  random  truncation  is  not  mentioned  in  several  authoritative  texts 
such  as  Kalbfieisch  and  Prentice  [19]  and  Fleming  and  Harrington  [20],  and  may  be 
unfamiliar  to  some  readers,  and  therefore  will  be  discussed  in  this  and  the  following 
subsection. 

Let  the  truncation  time  Y  have  distribution  function  G{y)  and  the  failure  time 
(time  of  cancer  diagnosis)  X  have  distribution  function  F{x).  We  require  that  trunca¬ 
tion  be  independent  of  failure  and  for  simplicity  assume  no  censoring  for  the  present. 
Observations  are  conditional  on  X  >  T.  Let  G*{y)  and  F*{x)  be  the  corresponding 
distribution  functions,  conditional  on  X  >  T.  Let  S{x)  =  1  —  F{x)  be  the  survivor 
function  of  X.  Suppose  that  we  have  observations  (Tf^jX^), . . . ,  {Y*,X*)  from  the 
conditional  distribution.  The  full  likelihood  of  the  observed  data  is  given  by 

L  =  n  ldF{Xi)dG(Yi)/c,] ,  (4) 

i=i 

where  a  =  J  dF{x)dG{y).  A  key  observation  is  that  if  X  and  Y  are  independent, 
then  the  hazard  of  X  given  X  >  Y  =  y  Sit  x  >  y  is  equal  to  the  hazard  of  X  at  x 
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[21,22].  This  observations  leads  to  the  result,  first  mentioned  by  Kaplan  and  Meier 
[23],  that  if  the  distribution  G{t)  is  allowed  to  vary  freely,  the  natural  generalization 
of  the  product  limit  estimator,  given  by  the  formula 


where  R{u)  =  #{Y*  <  U  <  X*}  is  the  number  at  risk  at  U,  is  the  nonparametric 
maximum  likelihood  estimator  (NPMLE)  of  the  survivor  function  S{t)  of  X  (see,  for 
example  [21,22,24]). 

This  result  extends  naturally  to  the  case  with  random  independent  censoring 
[24] .  It  also  easily  follows  that  in  the  nonparametric  setting  (again  with  no  censoring), 
maximizing  (4)  is  equivalent  to  maximizing  the  conditional  likelihood  of  (X^ , . . . ,  X*) 
given  (Tj*, . ..  ,Y*),  which  can  be  written 

CL  =  flf(x;)/s{Yn.  (6) 

2=1 

(see,  for  example,  [23-26]).  Maximizing  the  conditional  likelihood  also  leads  to  the 
familiar  Nelson- Aalen  estimator  for  the  integrated  hazard  function  H{t)  of  X  [24], 
which  is  given  by 

A(()  =  Y,  R(x;)-K  (7) 

Xf<t 

These  results  can  be  extended  to  the  case  of  right  censoring  [24]. 

2.3  Truncation:  Parametric  Models 

We  consider  the  situation  where  X  and  Y  are  independent,  F{x)  is  parametrized, 
while  G{y)  is  allowed  to  vary  freely.  In  a  later  subsection  F{x)  will  be  come  from  a 
quadratic  spline  model. 

The  data  are  independent  pairs  (^i,  xi), . . . ,  (y„,  Xn)  from  the  joint  distribution 
(y,X),  conditional  on  (Y  <  X).  We  suppose,  for  simplicity,  that  there  are  no  ties 
among  yi,y2,  ■  ■  ■  ,yn,  and  suppose  X  has  absolutely  continuous  distribution  function 
coming  from  a  family  F{x;z)  parameterized  by  a  vector  z,  with  corresponding  sur¬ 
vival  function  S{x-,  z)  =  1  —  F{x;  z)  and  density  f{x;  z).  The  NPMLE  for  G  should 
consist  of  (unknown)  point  masses  qi,q2,  ■  ■  ■  iQn  placed  at  the  points  yi, t/2,  •  •  • , Z/n- 
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The  logarithm  of  the  complete  likelihood  (4)  can  be  rewritten 

n 

+log(9i)] -nlog  J2S{yi]z)qi 

j=l 

If  we  factor  the  out  the  part  of  the  likelihood  corresponding  to  (6),  the  logarithm  is 
given  by 

\og{CL)  =  -  \og{S{yi-,  ^)].  (9) 

We  now  discuss  the  changes  which  must  be  made  in  when  censoring  and  additional 
covariates  are  present.  If  s  is  a  vector  of  additional  covariates,  A(a:,  s;  z)  denotes  the 
hazard  associated  with  F{x,  s;  z)  and  A{x,  s;  z)  the  cumulative  hazard,  we  note  that 
(9)  becomes 

log(CL)  =  ^[log(A(xi,  Si]  i))  -  (A(a;i,  sf,  z)  -  A{yi,  i))].  (10) 

2=1 

In  the  presence  of  right  censoring  which  is  independent  of  both  the  failme  and 
truncation  times,  Xi  is  replaced  in  the  above  formulation  by  the  minimum  of  the 
failure  and  censoring  time.  The  term  in  the  likelihood  is  replaced  by 

f{x,  s;  ^^S{x,  s;  where  (5j  =  1  if  observation  *  is  a  failure  and  =  0  otherwise, 

and  the  conditional  likelihood  (6)  (with  Xi,  Si  and  yi  regarded  as  fixed)  becomes 

CL  =  f\[f{xuSi]^^S{xi,Si]zf^~^^]/S{yi,Si]^. 

i=l 

In  this  setting  log(C'T)  becomes 

log(C'L)  =  log(A(a;i,  su  Zi))  -  (A(a;i,  s,;  z)  -  A{yi,  si;  f))].  (11) 

2=1 

In  the  subsequent  analysis  we  choose  to  maximize  (11)  rather  than  the  full  likelihood. 

2.4  Spline  Models 

We  choose  to  model  the  hazard  via  quadratic  splines  as  in  [27].  A  quadratic  spline 
with  m  knots  specifies  the  hazard  to  be  of  the  form 

2  m 

Am(i)  =  9'i2(^  -  rj)l  (12) 

i=0  j=l 

where  (x)+  =  max(a:,  0).  For  each  birth  cohort,  we  fit  splines  with  knots  which  were 
equally  spaced  in  the  interior  of  the  interior  [Tmin,  Tmax],  where  Tmin  is  the  minimum 


log(L)  =  f^[log(/( 
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truncation  age  in  the  cohort  and  Tmax  the  maximum  follow-up  (failure  or  censoring) 
time.  Restrictions  were  placed  on  the  coefficients  to  ensure  that  Xmit)  remained 
positive  for  all  t.  Thus  with  m  knots  the  number  of  parameters  was  m  +  3.  Models 
were  fit  using  maximum  likelihood  techniques  applied  to  the  conditional  likelihood, 
as  given  by  (11). 

The  hazard  function  was  estimated  for  breast  cancer  incidence  (women  only)  and 
for  colorectal  cancer  incidence  (both  men  and  women).  The  spline  estimates  were 
computed  by  maximizing  log  (CL)  using  the  algorithm  of  Powell  [28].  We  started 
with  one  knot  and  increased  the  number  of  knots  until  the  fit  was  not  improved,  as 
determined  by  the  likelihood  ratio  test  at  the  significance  level  a  =  0.05.  Two  other 
subcohort  estimates  of  the  hazard  function  were  computed  for  comparison  with  the 
spline  estimator;  a  life  table  version  of  (5),  and  a  Gaussian  kernel  estimate  based  on 
the  Nelson- Aalen  estimator  (7). 

2.5  Proportional  Hazards 

It  became  clear  when  fitting  models  to  the  subcohorts,  that  there  was  a  birth  cohort 
effect  in  the  data.  At  the  same  time,  we  wished  to  have  estimates  of  the  hazard  for 
the  entire  age  range  of  34-100-1-  years.  We  therefore  fit  proportional  hazards  models 
with  splines  Xrn(i)  for  the  baseline  hazard  and  a  single  covariate  s  representing  birth 
year.  The  resulting  hazard  function  has  the  form 

Xm(t,  s;  (5)  =  ex^{l3s)Xm{t).  (13) 

The  model  was  again  fit  using  the  conditional  likelihood  of  the  form 

log(C'L)  =  ^[5ilog(A,„(xi,Si;/?))  -  (A(xi,  s*;/?)  -  A(yi,  Si;/5))],  (14) 

i=l 

which  is  (11)  with  A(x,  s,  i)  =  Xm{xi,  Si-,P). 

3.  Results 

Estimates  of  the  age  specific  hazard  for  for  female  breast  cancer  are  presented  in 
Figure  1  for  the  1874-1889,  1890-1899,  1900-1909,  1910-1919,  and  1920-1931  birth 
subcohorts.  Age  specific  hazards  for  colorectal  cancer  are  presented  in  Figures  2 
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and  3,  stratified  by  birth  cohort  and  gender.  Each  figure  presents  three  estimates 
of  the  hazard  from  the  subcohort  alone,  namely  the  life  table  estimate,  the  kernel 
estimate  based  on  the  Nelson-Aalen  estimator  and  a  spline  estimate,  as  well  as  and 
one  gender-specific  estimate  from  a  proportional  hazards  model  with  birth  year  as 
covariate,  fit  to  data  from  all  birth  subcohorts  (1874-1931).  The  covariate  is  set 
to  the  mean  birth  year  of  the  subcohort.  We  note  that  approximately  40  years  of 
follow  up  are  available  for  any  one  subcohort,  as  follow  up  data  are  available  from 
approximately  1965-1995. 

We  found  that  splines  with  very  few  knots  appeared  to  fit  the  data.  In  all  but  one 
case  two  knots  were  sufficient  for  the  spline  estimates,  as  determined  by  the  likelihood 
ratio  test,  and  in  the  remaining  case  (breast  cancer,  birth  years  1874-1889)  one 
knot  sufficed.  The  hazard  function  for  both  breast  and  colorectal  cancer  appears  to 
increase  monotonically,  at  least  until  the  age  of  85  or  90,  when  the  subcohort  specific 
estimates  of  the  hazard  estimates  for  women  for  both  breast  and  colon  cancer  appear 
to  flatten  or  decrease  while  the  estimate  for  men  appears  to  continue  to  increase. 
(In  each  of  the  three  cases  the  proportional  hazards  model  provides  estimates  of 
the  hazard  function  which  increase  through  all  ages.)  We  also  note  that  in  all  the 
proportional  hazards  models  the  birth  cohort  effect  was  highly  significant  (pj  0.0001). 
We  also  see  from  the  subcohort  analysis  that  the  proportional  hazards  assumption 
appears  to  be  adequate,  at  least  up  until  the  age  of  85  or  90,  when  proportionality 
may  fail  for  women. 

We  also  note  that  the  colorectal  cancer  risk  estimates  are  higher  for  men  than 
for  women.  For  example,  the  estimated  age  specific  yearly  hazard  for  the  1920-1931 
birth  cohort  at  age  70  is  approximately  .0013  for  women,  and  about  .0017  for  men, 
or  about  30%  higher  for  men. 

The  estimated  hazard  from  the  proportional  hazards  models  over  a  70  year  range 
are  presented  in  Figures  4-6.  The  estimated  hazards  increase  as  the  birth  cohorts 
become  more  recent,  with  coefficient  estimates  of  /3  =  0.0347  (year“^)  for  female 
breast  cancer,  /?  =  0.016  (year“^)  for  female  colorectal  cancer  and  p  —  0.020  (year”^) 
for  male  colorectal  cancer.  Thus,  the  additional  hazard  for  more  recent  birth  cohorts 
appears  to  be  more  pronounced  for  breast  cancer  than  for  colorectal  cancer. 
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4.  Discussion 


As  noted  in  the  Introduction,  the  presence  of  a  large  degree  of  heterogeneity  in  the 
risk  for  a  population  may  lead  to  a  decreasing  age  specific  hazard  function.  Since  we 
see  little  or  no  evidence  of  a  decreasing  hazard  for  either  breast  or  colorectal  cancer 
at  least  until  age  85  or  90,  it  appears  that  the  risk  is  relatively  homogeneous  for  both 
these  cancers  over  this  age  range.  In  particular,  there  appears  to  be  little  evidence 
for  a  high  immune  fraction  for  either  breast  or  colorectal  cancer.  We  should  also 
note  that  the  presence  of  a  monotone  increasing  hazard  over  a  limited  range  does 
not  completely  rule  out  heterogeneity.  The  data  is  quite  consistant  with  the  degree 
of  heterogeneity  that  might  result  from  known  cancer  genes,  as  long  as  the  risk  is 
generally  increasing  (at  least  through  age  90)  in  the  population  as  a  whole.  There 
is  little  or  no  evidence  of  an  inherited  component  to  the  risk,  as  a  large  inherited 
component  might  be  expected  to  provide  a  local  maxima  to  the  hazard  rather  early 
in  life,  certainly  prior  to  age  85. 

One  may  extend  the  more  general  two-stage  model  of  carcinogenesis  presented  in 
the  Introduction  to  take  cell  death  into  account,  by  adding  a  Poisson  process  of  cell 
death  which  competes  with  the  process  of  malignant  transformation,  as  suggested  by 
Yakovlev  and  Polig  [29].  This  model  has  been  successfully  applied  to  data  from  radi¬ 
ation  induced  and  chemically  induced  lesions  [30-32] .  With  the  cell  death  component 
it  becomes  less  clear  that  the  hazard  function  should  increase  monotonically  in  the 
case  of  spontaneous  carcinogenesis.  In  fact,  in  the  simplified  case  of  constant  rates 
of  initiation  and  z/2  of  cell  death,  and  arbitrary  cumulative  distribution  function 
F{t)  for  time  to  transformation  of  intermediate  lesions,  the  hazard  function  for  time 
to  tumor  has  the  form 

X{t)  =  uiexp{-U2t)F{t).  (15) 

We  note  that  according  to  this  model  the  clock  for  cell  death  in  this  model  starts 
at  birth.  If  the  constant  U2>  0  in  (15),  then  X(t)  must  decrease  exponentially  since 
F{t)  approaches  one  as  t  approaches  infinity.  We  conjecture  that  in  the  present 
context  the  cell  death  component  is  very  small,  so  that  it  does  not  dominate  X{t) 
until  after  age  85.  The  higher  hazard  rate  for  male  colorectal  cancer,  as  well  as  the 
continued  increase  in  hazard  through  age  105,  may  be  attributed  to  a  smaller  rate 
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of  cell  death.  Another  possibily  is  that  the  cell  death  should  not  be  measured  from 
birth,  but  from  formation  of  the  initiated  cell  (as  in  another  variation  of  the  model 
suggested  in  [29]). 

We  have  noted  in  the  Results  section  that  proportionality  of  hazard  appears  to 
fail  after  age  90  for  both  breast  and  colorectal  cancer  in  women.  This  result  may  be 
due  to  sampling  variability,  or  additional  bias  unique  to  women  at  these  high  ages. 
We  note  that  there  are  only  116  female  breast  cancer  cases  and  77  female  colorectal 
cancer  cases  after  age  90.  They  are  distributed  over  a  15  year  period,  for  an  average 
of  7.7  breast  cancer  and  5.1  colorectal  cancer  cases  per  year  in  this  range.  In  addition, 
data  linkage  is  more  difficult  for  women,  who  are  more  likely  to  have  changed  names 
than  men.  An  additional  indication  that  the  lack  of  proportionality  for  women  may 
be  spurious  is  that  we  do  not  see  this  apparent  lack  of  proportionality  in  men. 
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Legends  to  figures 

Figure  1.  Four  estimates  of  the  age-specific  hazard  function  for  female  breast  cancer, 
stratified  by  birth  cohort:  a  spline  estimate  (labeled  ’’Spline”),  a  kernel  estimate 
based  on  the  Nelson- Aalen  estimator  (labeled  ’’Kernel”),  a  life  table  estimate  (la¬ 
beled  ’’Life  Table”),  and  a  proportional  hazards  spline  estimate  using  all  strata,  with 
birth  year  as  sole  covariate,  set  at  the  stratum  mean  (labeled  ’’Combined  Spline”). 

Figure  2.  Four  estimates  of  the  age-specific  hazard  function  for  female  colorectal  can¬ 
cer,  stratified  by  birth  cohort:  a  spline  estimate  (labeled  ’’Spline”),  a  kernel  estimate 
based  on  the  Nelson- Aalen  estimator  (labeled  ’’Kernel”),  a  fife  table  estimate  (la¬ 
beled  ’’Life  Table”),  and  a  proportional  hazards  spline  estimate  using  all  strata,  with 
birth  year  as  sole  covariate,  set  at  the  stratum  mean  (labeled  ’’Combined  Spline”). 

Figure  3.  Four  estimates  of  the  age-specific  hazard  function  for  male  colorectal  can¬ 
cer,  stratified  by  birth  cohort:  a  spline  estimate  (labeled  ’’Spline”),  a  kernel  estimate 
based  on  the  Nelson-Aalen  estimator  (labeled  ’’Kernel”),  and  a  life  table  estimate  (la¬ 
beled  ’’Life  Table”),  and  a  proportional  hazards  spline  estimate  using  all  strata,  with 
birth  year  as  sole  covariate,  set  at  the  stratum  mean  (labeled  ’’Combined  Spline”). 

Figure  4.  Comparison  of  the  age-specific  hazard  function  estimates  for  female  breast 
cancer  for  various  birth  cohort  strata  from  a  proportional  hazards  model  spline  model. 
Birth  year  covariate  set  at  the  mean  value  for  each  stratum:  1884.41  for  the  1874- 
1889  stratum,  1894.90  for  the  1890-1899  stratum,  1904.54  for  the  1900-1909  statum, 
1914.52  for  for  the  1910-1919  statum,  and  1925.24  for  the  1920-1931  stratum. 

Figure  5.  Comparison  of  the  age-specific  hazard  function  estimates  for  female  col¬ 
orectal  cancer  for  various  birth  cohort  strata  from  a  proportional  hazards  model 
spline  model.  Birth  year  covariate  set  at  the  mean  value  in  each  stratum:  1884.41 
for  the  1874-1889  stratum,  1894.90  for  the  1890-1899  stratum,  1904.54  for  the  1900- 
1909  statum,  1914.52  for  for  the  1910-1919  statum,  and  1925.24  for  the  1920-1931 
stratum. 
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Figure  6.  Comparison  of  the  age-specific  hazard  function  estimates  for  male  colorec¬ 
tal  cancer  for  various  birth  cohort  strata  from  a  proportional  hazards  model  spline 
model.  Birth  year  covariate  set  at  the  mean  value  in  each  stratum;  1884.74  for  the 
1874-1889  stratum,  1895.06  for  the  1890-1899  stratum,  1904.74  for  the  1900-1909 
statum,  1914.57  for  for  the  1910-1919  statum,  and  1925.31  for  the  1920-1931  stra¬ 
tum. 
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Table  1.  Number  of  female  subjects  and  cases  of  breast  and  colorectal  cancer,  strat¬ 
ified  by  birth  year. 


Table  2.  Number  of  male  subjects  and  cases  of  colorectal  cancer,  stratified  by  birth 


year. 


Birth  Years 

Number  of 
Subjects 

No.  of  colorectal 

cancer  cases 

1874-1889 

6,850 

101 

1890-1899 

16,307 

341 

1900-1909 

27,122 

768 

1910-1919 

34,731 

874 

1920-1931 

41,131 

587 

Total 

126,141 

2671 
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ABSTRACT 


BACKGROUND;  Several  measures  of  familial  disease  aggregation  have  been  pro¬ 
posed,  but  only  a  few  of  these  are  designed  to  be  implemented  at  the  individual  level. 
We  evaluate  four  of  them  in  the  context  of  breast  cancer  incidence. 

METHOD:  A  population-based  cohort  consisting  of  114,429  women  born  between 
1874  and  1931  and  at  risk  for  breast  cancer  after  1965  was  identified  by  linking 
the  Utah  Population  Data  Base  and  the  Utah  Cancer  Registry.  Three  competing 
methods  were  used  to  obtain  predictors  of  familial  aggregation  of  risk:  the  number  of 
first  degree  relatives  with  breast  cancer,  the  posterior  probability  of  carrying  BRCAl 
or  BRCA2,  and  the  Familial  Standardized  Incidence  Ratio  (FSIR),  which  weights 
the  disease  status  of  relatives  based  on  their  degree  of  relatedness  with  the  proband. 
Spline  regression  methods  were  used  to  estimate  the  hazard  function,  stratified  by 
measures  of  familial  aggregation. 

RESULTS:  When  the  measures  of  family  history  are  dichotomized  with  approxi¬ 
mately  8.5%  of  subjects  in  the  high  risk  category,  presence  of  a  first  degree  relative 
and  FSIR  perform  equally  well  at  determining  individual  risk,  with  the  high  risk 
category  having  approximately  twice  the  risk  at  all  ages.  The  posterior  probability 
of  BRCAl  and  BRCA2  performed  less  well.  When  FSIR  is  further  stratified,  the  top 
0.1%  have  an  approximate  4-fold  increase  in  risk.  The  risk  appears  to  be  increasing 
through  all  age  groups. 

CONCLUSIONS:  Family  history  is  a  highly  significant  indicator  of  risk  for  breast 
cancer. 

KEYWORDS:  familial  risk,  hazard  function,  truncation,  survival  analysis,  breast 
cancer 
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Introduction 


Heterogeneity  in  a  population  may  lead  to  population  estimates  of  the  hazard  that 
do  not  reflect  individual  risk.  For  example,  if  we  let  X{t)  denote  the  hazard  function, 
and  p  the  probability  of  immunity  to  a  particular  disease,  it  follows  from  the  formula 

p=  limexpj—  f  A(u)dM|, 

that  there  are  individuals  who  are  ’’immune”  in  the  population  exactly  when  the 
hazard  function  has  finite  integral.  In  particular,  limt_>.oo  A(t)  =  0,  provided  the 
limit  exists.  More  generally,  a  large  degree  of  heterogeneity  in  disease  susceptibility 
may  lead  to  a  population  hazard  function  with  one  or  more  well-defined  maxima. 
The  maxima  may  correspond  to  discrete  subpopulations  with  different  genetic  pre¬ 
disposition  to  disease.  A  maximum  may  also  result  from  a  continuous  frailty,  as 
the  surviving  population  at  higher  ages  may  be  overrepresented  by  individuals  with 
lower  riskb 

In  fact,  there  is  evidence  of  heterogeneity  for  most  cancers.  According  to  Easton^, 
’’All  cancer  types  exhibit  familial  clustering,  suggestive  of  a  significant  inherited 
component” .  He  goes  on  to  conclude  that  as  of  1994  known  cancer  genes  accounted 
for  0.5-1%  of  all  cancer  cases,  and  that  this  figure  would  increase  as  more  cancer 
genes  are  discovered.  The  breast  cancer  genes  BRCAl  and  BRCA2  both  contribute 
to  an  increased  risk  of  breast  cancer.  BRCAl  has  an  estimated  allele  frequency  of 
between  0.0002  and  0.001  (95%  CI)^,  and  accounts  for  about  3%  of  diagnosed  breast 
cancer"*.  The  allele  frequency  of  mutations  in  BRCA2  is  estimated  at  0.00022^. 
Vehmanen  et  al.^  found  that  only  21%  of  breast  cancer  families  were  accounted  for 
by  mutations  of  BRCAl  and  BRCA2,  providing  indirect  evidence  for  the  existence 
of  other,  undiscovered  breast  cancer  genes. 

In  our  previous  paper'^,  linked  populations-based  data  from  the  Utah  Cancer 
Registry  and  the  Utah  Population  Data  Base  was  used  to  estimate  the  population- 
level  hazard  function  for  breast  and  colorectal  cancer,  stratified  by  birth  cohort.  We 
found  that  the  hazard  functions  for  both  breast  and  colorectal  cancer  appeared  to  be 
monotone  increasing  functions  for  both  genders  and  all  birth  cohorts.  This  contrasts 
with  the  model-based  estimates  of  Moolgavkar  et  al.^,  who  found  the  hazard  function 
to  sharply  decrease  starting  sometime  past  the  age  of  70. 
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The  lack  of  clear  multiple  modes  in  the  hazard  function  made  it  clear  that  more 
delicate  methods  would  be  needed  to  account  for  the  known  heterogeneity  of  risk. 

A  number  of  measures  of  familial  disease  aggregation  have  been  used  or  proposed, 
but  only  a  few  of  these  are  designed  to  be  implemented  at  the  individual  level. 
The  most  common  epidemiologic  measure  of  familial  risk  is  an  indicator  of  whether 
one  or  more  first-degree  relatives  has  been  diagnosed  with  cancer  or  some  other 
disease.  Khoury  and  Flanders®  have  noted  that  measures  of  this  sort  are  prone 
to  bias  under  a  variety  of  conditions.  Nonetheless,  it  is  a  widely  used  and  easily 
understood  measure  of  familial  risk  that  can  easily  be  ascertained  in  a  clinical  setting. 
A  second  category  of  family  history  measures  suggested  by  Kerber^®  are  derived  from 
the  complete  risk  experience  of  all  observable  biological  relatives  adjusted  for  the  age, 
sex,  number  and  degree  of  the  relatives.  The  total  familial  risk  is  summarized  as  a 
familial  standardized  incidence  ratio  (FSIR)  or  a  familial  rate  (FR).  FSIR  and  FR  are 
less  prone  to  bias  and  substantially  more  sensitive  than  a  crude  indicator  variable,  but 
require  fairly  detailed  family  history  data  which  may  rarely  be  available  in  a  clinical 
setting.  A  third  measure,  particularly  relevant  for  breast  cancer,  was  introduced  by 
Parmigiani  et  Parmigiani  et  al.  estimated  the  posterior  probability  that  an 

individual  carried  the  breast  cancer  genes  BRCAl  and  BRCA2  using  information  on 
first  and  second  degree  relatives  of  the  subject.  The  method  relies  heavily  on  prior 
estimates  of  risk  to  carriers,  and  prior  estimates  of  prevalence  of  the  the  genes. 

In  this  paper  age  specific  estimates  of  the  hazard  function  for  breast  cancer  in¬ 
cidence  is  estimated,  stratified  by  the  above  measures  of  family  history.  It  is  found 
that  FSIR  and  presence  of  a  first  degree  relative  with  breast  cancer  are  highly  signif¬ 
icant  predictors  of  increased  risk,  with  an  identified  high  risk  category  having  twice 
the  risk.  The  hazard  function  for  breast  cancer  appears  to  increasing  as  a  function 
of  age  in  all  risk  groups. 

Hazard  Function  Estimation 

Data 

The  data  for  this  study  were  obtained  by  linking  records  from  the  Utah  Population 
Data  Base  (UPDB)  with  the  Utah  Cancer  Registry  (UCR).  The  UPDB  consists  of 
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the  genealogical  records  of  more  than  1,000,000  individuals  who  were  born,  died,  or 
married  in  Utah,  or  en  route  to  Utah  during  the  nineteenth  and  twentieth  centuries. 
The  available  follow-up  information  comes  either  from  Utah  death  certificates,  which 
have  been  linked  to  the  UPDB  genealogical  data  every  year  from  1933  through  the 
beginning  of  1997,  or  from  linkage  of  the  HCFA  beneficiary  data  to  the  UPDB. 
The  study  population  consists  of  122,208  women  recorded  in  the  Utah  Population 
Database,  who  were  born  from  1874  to  1931  and  for  whom  follow-up  information 
is  available  that  places  them  in  Utah  during  the  years  of  operation  of  the  Utah 
Cancer  Registry  (1966-present).  Subjects  with  purported  follow-up  past  age  105 
were  excluded  from  the  data.  Potential  subjects  who  had  no  relatives  who  were  also 
in  the  risk  set,  and  therefore  for  whom  no  measures  of  familial  aggregation  could 
be  computed,  were  removed  from  the  data.  Excluding  these  two  groups  removed 
an  additional  7779  women,  leaving  a  study  population  of  114,429  women.  There 
are  5,092  cases  of  female  breast  cancer  in  the  data.  Only  female  breast  cancer  was 
analyzed.  Additional  details  on  the  data  are  given  in  Boucher  and  Kerber’’’. 

Nonparametric  Hazard  Estimation 

The  data  described  above  are  subject  to  random  truncation:  cases  which  occurred 
during  or  before  1965  are  not  recorded  in  the  dataset.  Subject  were  between  the 
ages  of  34  and  86,  at  the  time  of  truncation.  Thus,  analysis  of  the  data  must  take 
into  account  not  only  to  the  effects  of  right  censoring,  but  also  the  effects  of  left 
truncation  due  to  delayed  entry  into  the  risk  set. 

Let  the  truncation  time  V  have  distribution  function  G(y),  the  minimum  of  the 
failure  and  censoring  time  be  X  and  have  distribution  function  F{x),  and  5  be  the 
censoring  indicator,  with  5  =  \  signifying  a  censored  observation.  We  require  that 
truncation  and  censoring  be  independent  of  failure.  Observations  are  conditional  on 
X  >  Y.  Let  G*{y)  and  F*(x)  be  the  corresponding  distribution  functions,  condi¬ 
tional  on  A  >  y.  Let  S{x)  be  the  survivor  function  for  the  failure  time  distribution. 
Suppose  that  we  have  observations  (Tj*,  . . . ,  (y„*,  A*,5*),  from  the  condi¬ 

tional  distribution,  where  for  simplicity  we  describe  the  situation  with  no  tied  failure 
times.  Our  nonparametric  methods  are  based  on  the  nonparametric  maximum  like- 
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lihood  estimator  (NPMLE) 


s{t)=  n  fi- 

Xf<t,Si=0  \ 

t  —  J  * 


(1) 


where  R{U)  =  #{Y*  <U  <  X*}  is  the  number  of  subjects  at  risk  at  U,  as  described, 
for  example  in  Keiding^^.  The  Nelson- Aalen  estimator  of  the  cumulative  hazard  is 
given  by 

A(t)  =  R(x;)-'.  (2) 

X*<t 

t  — 


Parametric  Hazard  Estimation 


We  again  assume  that  X,  Y  and  5  are  as  above.  We  wish  to  have  a  parameteriza¬ 
tion  Fi{x,s;^  of  the  failure  time  distribution,  with  covariate  vector  s  and  parame¬ 
ter  vector  z.  We  denote  the  corresponding  density  fi{x,s-,z)  and  survival  function 
Si{x,s-,z).  We  condition  on  A  >  T,  and  in  analogous  fashion  to  what  is  done  in 
the  nonparametric  setting,  maximize  the  logarithm  of  the  conditional  likelihood.  Let 
A(a:,  s;  z)  denotes  the  hazard  associated  with  Fi{x,  s;  5)  and  A(rE,  s;  F)  the  cumulative 
hazard.  The  likelihood,  conditional  of  A  >  y,  becomes 

log(C'L)  =  log(A(a:i,  s*;  £*))  -  {A{xi,  s*;  f)  -  A{yi,  su  fi))].  (3) 

i=l 

We  modeled  the  hazard  via  quadratic  splines^^.  A  quadratic  spline  with  m  knots 
specifies  the  hazard  to  be  of  the  form 

2  m 

hm{t)  =  (4) 

i=0  j—1 

where  {x)+  —  max(a;,  0).  For  each  birth  cohort,  we  fit  splines  with  knots  which  were 
equally  spaced  in  the  interior  of  the  interior  [Tmin,  Tmax],  where  Tmin  is  the  minimum 
truncation  age  in  the  cohort  and  Tmax  the  maximum  follow-up  (failure  or  censoring) 
time.  Restrictions  were  placed  on  the  coefficients  to  ensure  that  Xmit)  remained 
positive  for  all  t.  Thus  with  m  knots  the  number  of  parameters  was  m  -f  3.  Models 
were  fit  by  maximizing  the  conditional  likelihood. 

We  fit  proportional  hazards  models  with  splines  A^(t)  for  the  baseline  hazard  and 
a  covariate  vector  s  with  one  component  for  birth  year  and  perhaps  one  component 
for  family  history.  Birth  year  was  shown  to  be  highly  significant  in  our  previous 


6 


paper and  may  account  for  such  effects  as  a  decrease  in  parity  and  an  increase  in 
the  efficacy  of  detection  methods  with  time.  The  resulting  hazard  function  has  the 
form 

2 

~  Pi  jt)  ■  (5) 

j-1 

The  model  was  fit  using  the  conditional  likelihood  (3)  with  A(x,  s,  =  Xm{x,  s;  /3). 

The  hazard  function  was  estimated  for  female  breast  cancer.  The  spline  estimates 
were  computed  by  maximizing  log(C'L)  using  the  algorithm  of  PowelP^.  We  started 
with  one  knot  and  increased  the  number  of  knots  until  the  fit  was  not  improved,  as 
determined  by  the  likelihood  ratio  test  at  the  significance  level  a  =  0.05.  The  life 
table  estimator  based  on  (2)  was  used  for  comparison  with  the  spline-based  estimator. 

Methods  of  Familial  Aggregation 

Number  of  First  Degree  Relatives 

The  simplest  and  most  easily  understandable  is  the  number  of  first  degree  relatives 
with  breast  cancer.  Of  the  114,429  women  in  the  data  set,  9765,  or  approximately 
8.5%,  had  at  least  one  first  degree  relative  with  breast  cancer  also  represented  in  the 
data,  and  795  women,  or  0.69%,  had  two  or  more  relatives  in  the  data.  Having  more 
than  two  first  degree  relatives  with  breast  cancer  was  extremely  rare:  56  women  had 
three,  and  10  women  had  the  maximum  of  four. 

Posterior  Probability  of  BRCAl  and  BRCA2 

We  used  the  method  of  Parmigiani  et  aP\  and  implemented  in  the  computer  pro¬ 
gram  BRCAPRO,  available  from  the  authors,  to  computed  posterior  probabilities 
of  carrying  BRCAl  and  BRCA2  mutations  for  each  of  our  subjects.  The  method 
uses  age  at  onset  of  breast  and  ovarian  cancer  for  first  and  second  degree  relatives 
to  compute  posterior  probabilities  of  carrying  BRCAl  and  BRCA2.  The  method 
incorporates  prior  distributions  for  the  risk  of  breast  and  ovarian  cancer  to  carriers 
and  noncarriers  of  the  breast  cancer  genes  BRCAl  and  BRCA2  as  well  as  prior  esti¬ 
mates  of  distribution  of  the  population  level  carrier  probabilities.  We  used  the  prior 
probability  distributions  suggested  by  Parmigiani  et  al}^. 
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Were  were  able  to  compute  posterior  probabilities  for  114,221  (or  99.8%)  of  the 
subjects  with  a  first  degree  relative  in  the  database.  The  mean  carrier  probabilities 
were  0.000301  and  000098  for  BRCAl  and  BRCA2  respectively,  with  medians  of 
0.000098  and  000015.  The  distributions  of  the  posterior  carrier  probabilities  are 
shown  in  Figures  1  and  2. 


Familial  Standardized  Incidence  Ratio 


The  second  measure  of  familial  aggregation  is  a  modification  of  the  familial  stan¬ 
dardized  incidence  method  (FSIR)^’’.  The  familial  standardized  incidence  ratio  is 
derived  from  the  complete  risk  experience  of  all  observable  biological  relatives,  ad¬ 
justed  for  age,  sex,  number  and  degree  of  the  relatives.  FSIR  is  defined  in  terms  of 
the  kinship  coefficient^^  c{i,j)  between  individuals  i  and  j,  which  gives  the  proba¬ 
bility  that  two  individuals  share  a  gene  at  a  given  locus.  The  kinship  coefficient  is 
defined  by  c{i,j)  ~  (1/2)  Y.p=\  where  Pi^j  is  the  total  number  of  paths  between 

individuals  i  and  j,  and  l{p)  is  the  length  in  reproductive  events  of  each  path  p.  Let 
Ij  =  1  if  the  ji’th  member  has  the  disease  and  0  otherwise.  Finally,  we  suppose  that 
we  have  a  stratified  population,  the  population  incidence  in  the  kth  statum  is  given 
by  Afc,  and  let  tjk  be  the  time  that  the  jth  person  spent  in  the  kth  stratum  of  risk. 
The  familial  standardized  incidence  ratio  is  then  defined,  for  the  i  individual,  by 


FSIRi  = 


E/=i  IjcjiJ) 

X)fc=l  Sj=l  j) 


In  deriving  a  measure  of  variance  V ARi  for  FSIRi,  it  was  assumed  that  the  de¬ 
nominator  of  the  above  expression  is  fixed,  and  that  for  each  fixed  path  length  the 
number  of  observed  cases  follows  a  Poisson  distribution  with  mean  equal  to  the 
expected  number  of  cases  in  the  stratum.  The  population  risk  estimates  used  to 
construct  the  denominator  of  FSIRi  were  assumed  to  be  fixed. 

A  difficulty  with  using  the  ’’raw”  FSIR  scores  is  that  the  amount  of  information 
from  which  it  is  constructed  for  a  particular  individual  is  highly  variable.  A  low  FSIR 
score  could  be  an  indicator  of  low  risk  or  simply  reflect  small  family  size.  We  therefore 
chose  to  adjust  the  scores  using  an  empirical  Bayes  procedure  before  incorporating 
them  into  a  regression  analysis.  As  the  raw  FSIR  scores  are  highly  skewed,  we  first 
transformed  them  using  a  loglog  transform  loglog(FS'/i?)  =  log(l  -blog(l  -I- F5/i?)). 
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The  basic  assumption  of  the  empirical  Bayes  adjustment  is  that  the  ’’true”  values  ji 
of  loglog  (FS'/i?)  are  normally  distributed.  The  mean  and  variance  of  n  are  estimated 
empirically  and  iteratively  from  the  data.  The  procedure  we  use  is  similar  to  the  one 
suggested  by  Greenland  and  Robins^®. 

More  specifically,  we  suppose  that  after  iteration  n  —  1  we  have  current  estimates 
and  crf^_i  for  the  true  value  and  ith  individual,  as  well  as  an  overall  mean  Hn-i 
and  variance  for  the  /ij.  We  then  computed  new  estimates  using  the  formulas 

C^2 

f^i,n  ~  f^n—1  T  (  2  i  2  )(^  l)) 

cr„_i  +  CTi 

where  Yi  =  loglogFSIRi,  and  with  variance  estimated  by 

2  ^ _ VARi _ 

(exp(//i,„_i)  exp(exp(/xi,„_i)  -  l))^ 

given  by  the  delta  method.  We  then  computed  the  sample  mean  and  variance  of 
over  all  the  subjects  to  get  //„  and  cr^. 

The  distribution  of  loglog(FSIR),  before  and  after  transformation,  are  displayed 
in  Figure  1.  Note  that  the  ’’raw”  distribution  is  bimodal,  with  a  mode  at  zero  which 
disappears  after  transformation. 

Results 

Dichotomized  Comparison  of  Familial  Risk 

We  diehotomized  each  of  our  measures  of  famililial  risk,  with  the  high  risk  category 
representing  approximately  8.5%  of  the  data  in  each  case.  This  was  a  natural  cut 
point,  as  it  represents  the  proportion  of  subjects  with  one  or  more  first  degree  rela¬ 
tives  with  breast  cancer.  The  cutoff  for  FSIR  roughly  corresponds  to  a  relative  risk 
of  two  to  family  members.  The  cut  points  for  the  posterior  probability  of  BRCAl 
and  BRCA2  come  at  points  where  the  posterior  probability  is  rather  small,  less  than 
0.0005  in  both  cases.  The  number  of  subjects  in  each  category  and  the  ranges  for 
the  variables  are  presented  in  Table  1. 

Our  previous  analysis  indicated  that  a  highly  significant  birth-year  effect  exists 
in  the  data'^,  with  a  women  born  ten  years  later  having  an  estimated  40%  increased 
age-specific  risk.  Birth-year  was  included  as  an  additional  covariate  in  all  regression 
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analyses.  The  baseline  risk  was  estimated  using  splines,  with  the  proportional  haz¬ 
ards  model  used  for  birth-year  and  familial  risk.  As  with  most  of  the  models,  we 
found  that  two  knots  were  sufficient  to  provide  an  optimal  fit.  Separate  estimates 
of  the  age-specific  hazard  for  each  level  of  each  of  our  familial  risk  measures  are 
presented  in  Figure  4.  For  comparison  we  provided  life  table  estimates  of  the  risk. 
The  life  table  estimates  are  not  adjusted  for  birth-year.  The  life  table  estimates 
are  flatter,  and  this  may  be  explained  by  a  significant  birth-cohort  effect.  Subjects 
contribute  to  the  risk  estimates  only  for  a  period  of  at  most  33  years  of  their  lives, 
namely  the  period  from  1965-1998.  A  women  born  in  1890  contributes  only  after  age 
75,  while  a  women  born  in  1930  contributes  from  age  35  until  the  age  of  68. 

The  presence  of  a  first  degree  relative  with  breast  cancer  and  the  dichotomized 
FSIR  variable  each  appear  to  be  equally  effective  at  distinguishing  high  risk  sub¬ 
jects,  with  the  high  risk  category  having  about  double  the  risk,  while  the  posterior 
probability  of  BRCAl  and  BRCA2  appear  to  be  less  effective. 

We  performed  a  more  detailed  stratified  analysis  of  FSIR.  The  category  bound¬ 
aries  were  the  approximate  75th,  90th,  and  99.9th  percentiles  of  the  (adjusted)  FSIR 
distribution.  The  upper  category  roughly  corresponds  to  the  reported  fraction  of 
the  general  population  carrying  known  breast  cancer  genes.  The  number  of  subjects, 
cases,  and  category  boundaries  are  given  in  Table  2.  Bootstrap  confidence  bands  were 
computed  as  well  as  an  indicator  of  the  reliability  of  the  estimates.  The  estimates  of 
the  age-specific  hazard  and  percentile-based  bootstrap  confidence  intervals  are  pre¬ 
sented  in  Figure  5.  The  bootstrap  confidence  intervals  are  based  on  100  bootstrap 
samples,  except  for  the  i75th  percentile  category,  which  is  based  on  20  bootstrap 
samples,  because  of  the  extensive  time  it  took  to  fit  the  models  to  the  large  datasets. 

Regression  Methods  Incorporating  Familial  Risk  as  a  Covari¬ 
ate 

We  incorporated  the  posterior  probabilities  of  BRCAl  and  BRCA2  and  their  log- 
arithms,as  well  as  \og\ogFSIR  as  continuous  variables  in  separate  analyses,  using 
a  proportional  hazards  model  with  birth-year  as  an  additional  covariate.  The  log- 
likelihoods  and  the  values  of  Xi  ^re  presented  in  Table  3.  We  see  that  the  best  result 
(in  terms  of  statistical  significance)  is  obtained  by  including  the  \og\ogFSIR,  where 
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we  get  a  likelihood  ratio  Xi  =  316.72,  {p  <  0.00001). 

We  also  considered  the  indicator  variable  NFIRST  for  presence/absence  of  a  first 
degree  relative,  in  a  proportional  hazards  model.  Prom  Figure  4A  it  can  be  seen  that 
the  proportional  hazards  assumption  is  not  grossly  violated.  The  variable  NFIRST 
was  highly  significant  (likelihood  ratio  Xi  —  185.6,  p  <  0.0001).  Addition  of  a  second 
indicator  variable  for  two  or  more  first  degree  relatives  with  breast  cancer  did  not 
improve  the  likelihood  significantly  (data  not  shown). 

Discussion 

We  have  applied  several  methods  of  measuring  familial  aggregation  at  the  individual 
level  to  breast  cancer  data.  All  prove  to  be  signficantly  significant  predictors  of 
individual  risk.  Judging  by  the  difference  in  risk  estimates,  as  well  as  the  likelihood 
ratio  test,  presence  of  a  first  degree  relative  and  FSIR  appear  to  be  better  indicators 
of  increased  risk  than  the  posterior  probability  of  BRCAl  or  BRCA2.  Judging  solely 
by  the  likelihood  ratio  test,  one  would  prefer  FSIR. 

FSIR  may  be  thought  as  an  extension  of  the  cruder  number  of  first  degree  relatives 
with  breast  cancer,  adjusting  for  the  level  of  relatedness  and  expected  disease.  It  is 
therefore  not  surprising  to  find  that  it  performs  better. 

Although  the  estimates  become  less  reliable  at  increasing  age,  the  hazard  function 
for  breast  cancer  appears  to  be  essentially  non-decreasing  in  all  the  categories  of  all 
familial  measures  considered.  Thus  we  find  no  evidence  of  an  ’’immune  fraction”  in 
this  analysis.  The  curves  for  different  levels  of  risk  appear  not  to  merge  or  cross, 
indicating  that  the  increased  risk  to  those  with  a  family  history  does  not  dissipate 
after  a  certain  age. 

Other  investigators  have  either  estimated  or  simply  assumed  that  the  risk  of 
breast  cancer  decreases  past  a  certain  age.  As  previously  noted,  Moolgavkar  et  al.^, 
found  the  hazard  function  to  sharply  decrease  starting  sometime  past  the  age  of  70. 
By  age  90,  the  risk  has  decreased  to  about  1/3  of  the  peak.  Parmigiani  et 
fit  breast  cancer  incidence  data  from  Easton  et  to  a  three  parameter  gamma 
distribution.  Implicit  in  this  fitting  procedure  is  the  assumption  that  the  risk  to 
carriers  of  BRCAl  and  BRCA2  decreases  to  zero  with  age.  There  is  little  actual 
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evidence  for  this  in  the  fitted  data,  as  the  last  age  is  70.  Although  based  on  sparse 
data,  our  estimates  show  no  evidence  for  decreased  risk  to  carriers  at  advanced  age. 

It  may  be  important  for  further  modeling  efforts  to  better  understand  the  hazards 
to.carriers  of  disease  suscept ability  genes,  particularly  at  more  advanced  ages,  where 
data  are  sparse. 
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Legends  to  figures 

Figure  1.  Distribution  of  the  posterior  probability  that  a  subject  carries  BRCAl. 

Figure  2.  Distribution  of  the  posterior  probability  that  a  subject  carries  BRCA2. 

Figure  3.  The  distribution  of  loglog(F'iS'/i?)  before  (A)  and  after  (B)  empirical  Bayes 
adjustment. 

Figure  4.  Spline  and  life  table  estimates  of  the  age-specific  hazard  for  breast  cancer, 
stratified  by  number  of  first  degree  relatives  (A),  posterior  probability  of  BRCAl (B), 
posterior  probability  of  BRCA2  (C),  and  empirically-Bayes  adjusted  FSIR.  The  high 
risk  category  contains  about  8.5%  of  the  subjects  in  each  case. 

Figure  5.  Stratified  spline-based  estimates  and  95%  bootstrap  confidencd  bands  for 
the  age-specific  hazard  function  for  breast  cancer.  The  categories  are  percentiles  0-75 
(A),  75-90(B),  90-99.9  (C),  and  99.9-100  (D)  of  the  adjusted  FSIR  distribution.  The 
scales  are  different,  for  better  resolution. 
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Table  1.  Number  of  subjects  and  range  of  the  risk  categories  for  the  dichotomized 
familial  risk  variables.  NFIRST  refers  to  the  number  of  first  degree  relatives  with 
breast  cancer.  Pr(BRCAl)  and  Pr(BRCA2)  refer  to  the  posterior  probability  of 
carrying  BRCAl  or  BRCA2  from  the  model  of  Parmigiani,  and  FSIR  refers  to  the 
familial  standardized  incidence  ratio. 


Risk 

Variable 


Low  Risk 


High  Risk 


subjects  range  subjects  range 


NFIRST  104680 


Pr(BRCAl)  104442  0-0.000452  9779  0.000452-0.96 
Pr(BRCA2)  104440  0-0.000173  9781  0.000173-0.335 


FSIR  I  104664  0.01-2.0 


2.0-6. 1 


Table  2.  Stratification  of  FSIR  for  analysis  with  four  categories,  together  with  the 
number  of  cases  per  category. 


Percentile  of  FSIR 

Range 

Subjects 

Cases  (%  Cases) 

i75 

0.01-1.2 

85822 

3279  (3.8%) 

75-90 

1.2-1.7 

17165 

951  (5.5%) 

90-99.9 

1.7-4.1 

11328 

845  (7.5%) 

99.9-100 

4.1-6.1 

114 

17  (14.9%) 
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Table  3.  Likelihood  ratio  statistics  estimates  for  models  with  posterior  probabilities 
of  BRCAl  and  BRCA2  or  their  logarithms,  as  well  as  FSIR.  see  text  for  details.  The 
chi-square  value  was  computed  using  the  likelihood  ratio  statistic. 


Variable 

Xi 

Pr(BRCAl) 

8.52 

Log(Pr(BRCAl) 

44.94 

Pr(BRCA2) 

5.52 

Log(Pr(BRCA2)) 

64.32 

Loglog(FS'/i?) 

316.72 
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